data pipeline
ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking
Lin, Lequan, Shi, Dai, Han, Andi, Chen, Feng, Chen, Qiuzheng, Li, Jiawen, Li, Zhaoyang, Li, Jiyuan, Sun, Zhenbang, Gao, Junbin
Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most "suspicious" cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- (4 more...)
Proportion and Perspective Control for Flow-Based Image Generation
Boudier, Julien, Caselles-Dupré, Hugo
While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: https://huggingface.co/obvious-research
Safe, Untrusted, "Proof-Carrying" AI Agents: toward the agentic lakehouse
Tagliabue, Jacopo, Greco, Ciro
Starting from this prototype, we conclude by outlining practical next steps for a full agentic lakehouse. The paper is organized as follows. After reviewing agent-friendly abstractions (Section II), we address key safety objections for high-stakes scenarios (Section III). Once safety is established, we describe a ReAct [12] loop built on these abstractions (Section IV). We put forward our working prototype as a feasibility demonstration of safe-by-design data agents, not as a full-fledged experimental benchmark. We believe that sharing working code is of great value to the community, especially in times of quickly shifting mental models. However, it is important to remember that our fundamental insights - programmability and safety - can be replicated independently of the chosen APIs. For these reasons, we believe our paper to be valuable to a wide range of practitioners: on one hand, those looking for a new mental map of this uncharted territory; on the other, those looking to be inspired by tinkering with existing implementations and inspecting systems working at scale.
- North America > United States (0.05)
- Europe > Germany (0.04)
InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment
Yusuf, Ibrahim Salihu, Houndayi, Iffanice, Oualha, Rym, Cherif, Mohamed Aziz, Panford-Quainoo, Kobby, Pretorius, Arnu
Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Zheng, Tong, Zhang, Hongming, Yu, Wenhao, Wang, Xiaoyang, Dai, Runpeng, Liu, Rui, Bao, Huiwen, Huang, Chengsong, Huang, Heng, Yu, Dong
Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.
- Asia > Middle East > Jordan (0.04)
- North America > United States > North Carolina (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- (2 more...)
Describe Anything: Detailed Localized Image and Video Captioning
Lian, Long, Ding, Yifan, Ge, Yunhao, Liu, Sifei, Mao, Hanzi, Li, Boyi, Pavone, Marco, Liu, Ming-Yu, Darrell, Trevor, Yala, Adam, Cui, Yin
Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
Automated Planning for Optimal Data Pipeline Instantiation
Amado, Leonardo Rosa, Vogel, Adriano, Griebler, Dalvan, Licks, Gabriel Paludo, Simon, Eric, Meneguzzi, Felipe
Data pipeline frameworks provide abstractions for implementing sequences of data-intensive transformation operators, automating the deployment and execution of such transformations in a cluster. Deploying a data pipeline, however, requires computing resources to be allocated in a data center, ideally minimizing the overhead for communicating data and executing operators in the pipeline while considering each operator's execution requirements. In this paper, we model the problem of optimal data pipeline deployment as planning with action costs, where we propose heuristics aiming to minimize total execution time. Experimental results indicate that the heuristics can outperform the baseline deployment and that a heuristic based on connections outperforms other strategies.
- South America > Brazil > Rio Grande do Sul (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
V2X-LLM: Enhancing V2X Integration and Understanding in Connected Vehicle Corridors
Wu, Keshu, Li, Pei, Zhou, Yang, Gan, Rui, You, Junwei, Cheng, Yang, Zhu, Jingwen, Parker, Steven T., Ran, Bin, Noyce, David A., Tu, Zhengzhong
The advancement of Connected and Automated Vehicles (CAVs) and Vehicle-to-Everything (V2X) offers significant potential for enhancing transportation safety, mobility, and sustainability. However, the integration and analysis of the diverse and voluminous V2X data, including Basic Safety Messages (BSMs) and Signal Phase and Timing (SPaT) data, present substantial challenges, especially on Connected Vehicle Corridors. These challenges include managing large data volumes, ensuring real-time data integration, and understanding complex traffic scenarios. Although these projects have developed an advanced CAV data pipeline that enables real-time communication between vehicles, infrastructure, and other road users for managing connected vehicle and roadside unit (RSU) data, significant hurdles in data comprehension and real-time scenario analysis and reasoning persist. To address these issues, we introduce the V2X-LLM framework, a novel enhancement to the existing CV data pipeline. V2X-LLM leverages Large Language Models (LLMs) to improve the understanding and real-time analysis of V2X data. The framework includes four key tasks: Scenario Explanation, offering detailed narratives of traffic conditions; V2X Data Description, detailing vehicle and infrastructure statuses; State Prediction, forecasting future traffic states; and Navigation Advisory, providing optimized routing instructions. By integrating LLM-driven reasoning with V2X data within the data pipeline, the V2X-LLM framework offers real-time feedback and decision support for traffic management. This integration enhances the accuracy of traffic analysis, safety, and traffic optimization. Demonstrations in a real-world urban corridor highlight the framework's potential to advance intelligent transportation systems.
- Asia > China (0.46)
- North America > United States > Wisconsin > Dane County > Madison (0.15)
- North America > United States > Texas > Brazos County > College Station (0.14)
- (2 more...)
- Transportation > Ground > Road (1.00)
- Consumer Products & Services > Travel (0.68)
- Transportation > Infrastructure & Services (0.68)
LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models
Kim, Yungi, Ha, Hyunsoo, Yang, Seonghoon, Lee, Sukyung, Kim, Jihoo, Park, Chanjun
Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the barriers to LLM development, enabling a wide range of organizations to access LLMs more easily.
ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks
Shi, Ziji, Li, Jialin, You, Yang
Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.
- North America > United States > Washington > King County > Redmond (0.06)
- Asia > Singapore > Central Region > Singapore (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Information Technology (0.46)
- Media (0.34)